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Abstract. This paper is devoted to minimizing the sum of a smooth function and a nonsmooth ^i-rcgularized term. This 
problem as a special cases includes the l\ -regularized convex minimization problem in signal processing, compressive sensing, 
machine learning, data mining, etc. However, the non-diffcrcntiability of the ^i-norm causes more challenging especially 
in large problems encountered in many practical applications. This paper proposes, analyzes, and tests a Barzilai-Borwein 
gradient algorithm. At each iteration, the generated search direction enjoys descent property and can be easily derived by 
minimizing a local approximal quadratic model and simultaneously taking the favorable structure of the £i-norm. Moreover, 
a nonmonotone line search technique is incorporated to find a suitable stcpsizc along this direction. The algorithm is easily 
performed, where the values of the objective function and the gradient of the smooth term are required at per-itcration. Under 
some conditions, the proposed algorithm is shown to be globally convergent. The limited experiments by using some nonconvex 
unconstrained problems from CUTEr library with additive ^i-regularization illustrate that the proposed algorithm performs 
quite well. Extensive experiments for ^i-rcgularized least squares problems in compressive sensing verify that our algorithm 
compares favorably with several state-of-the-art algorithms which are specifically designed in recent years. 
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1. Introduction. The focus of this paper is on the following structured minimization 



where / : R™ — > R is a continuously diffcrentiablc (may be nonconvex) function that is bounded below; || ■ \\i 
denotes the ^i-norm of a vector; parameter fx > is used to trade off both terms for minimization. Due 
to its structure, problem covers a wide range of apparently related formulations in different scientific 
fields including linear inverse problem, signal/image processing, compressive sensing, and machine learning. 

1.1. Problem formulations. A popular special case of model (|l.lj) is the ^i-norm regularized least 
square problem 



where A £ M. mxn (jn -C n) is a linear operator, and b 6 R m is an observation. Model (jl.2l) mainly appears 
in compressive sensing — an emerging methodology in digital signal processing, and has attracted intensive 
research activities over the past years [TTJ [12j QUI HSl [20] . Compressive sensing is based on the fact that if 
the original signal is sparse or approximately sparse in some orthogonal basis, an exact restoration can be 
produced via solving problem (|1.2[) . 
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min F(x) = f(x) + fj,\\x\\i, 



(1.1) 



m jE \\ Ax ~ b Wl + MN|l, 



(1.2) 
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Another prevalent case of that has been achieved much interest in machine learning is the linear 
and logistic regression. Given the training date A = [ai, • • ■ , a m ] T <G M. mxn and class labels y 6 { — 1, +l} m - 
A linear classifier is a hyperplane {wi : x 1 ai + b = 0}, where x £ l n is a set of weights and b € K is the 
intercept. A frequently used model is the ^2-loss support vector machine 

m 

min y maxjO, 1 — yAx 1 on + b)} 2 + ullxlli, (1-3) 

i—X 

Because of the "max" operation, the ^2-loos function is continuous, but not differcntiablc. Based on the 
conditional probability, another popular model is the logistic regression 



mm 



Vlog(l + e-(* TQ *+ b ^) + M |M|i. (1.4) 

i=l 

Obviously, the logistic loss function is twice diffcrentiable. 

Although the models of these problems have similar structures, they may be very different from real-data 
point of view. For example, in compressive sensing, the length of measurement m is much smaller than the 
length of original signal (m <C n) and the encoding matrix A is dense. However, in machine learning, the 
numbers of instance m and features n are both large and the data A is very sparse. 

1.2. Existing algorithms. Since the ^-regularized term is non-diffcrentiable when x contains values 
of zero, the use of the standard unconstrained smooth optimization tools are generally precluded. In the 
past decades, a wide variety of approaches has been proposed, analyzed, and implemented in compressive 
sensing and machine learning literatures. This includes a variety of algorithms for special cases where fix) 
has a specific functional form such as the least square (|1.2[) . the square loss (|1.3[) and the logistic loss p. 41) . 
In the following, we briefly review some of them in each literature. 

The first popular approach falls into the coordinate descent method. At the current iterate Xk, the 
simple coordinate descent method updates one component at a time to generate x? k , j = 1, . . . , n + 1, such 
that x\ = Xk, %k = x k+i: an d solves a one-dimensional subproblcm 

min F(x{ + ze j ) - F(x{), (1.5) 

where e J is defined as the j-th column of an identity matrix. Clearly, the objective function has one variable, 
and one non-differentiable point at z = — eK To solve the logistic regression model (|1.4p . BBR [23] solves the 
sub-problem approximately by the use of trust region method with Newton step; CDN |14j improves BBR's 
performance by applying a one-dimensional Newton method and a line search technique. Instead of cyclically 
updating one component at each time, the stochastic coordinate descent method |35j randomly selects the 
working components to attain better performance; the block coordinate gradient descent algorithm — CGD 
|381 143j is based on the approximated convex quadratic model for /, and selects the working variables with 
some rules. 

The second type of approach is to transform model into an equivalent box-constrained optimization 
problem by variable splitting. Let x = u — v with Ui = max{0, Xi) and Vi = max{0, —Xi}. Then, model (jl.ip 
can be reformulated cquivalently as 

n 

min f(u — v) + /i > (ui + Vi), s.t. u > 0, v > 0. (1-6) 

u,v — * 

i=l 

The objective function and constraints are smooth, and therefore, it can be solved by any standard box- 
constrained optimization technique. However, an obvious drawback of this approach is that it doubles 
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the number of variables. GPSR [22] solves (|1.6[) . and subsequently solves (|1.2[) . by using Barzilai-Borwein 
gradient method [2] with an efficient nonmonotone line search [24] . It is actually an application of the well- 
known spectral projection gradient [5] in compressive sensing. Trust region Newton algorithm — TRON 
[2"51 |4*2] minimizes (|1.6[) . then solves the logistic regression model (|1.4p . and exhibits powerful vitality by 
a series of comparisons. To solve (|1.3j) and (|1.4|) . the interior-point algorithm [26] [27] forms a sequence of 
unconstrained approximations by appending a 'barrier' function to the objective function (|1.6[) which ensures 
that it and i> remain sufficiently positive. Moreover, truncated Newton steps and preconditioned conjugate 
gradient iterations are used to produce the search direction. 

The third type of method is to approximate the ^-regularized term with a diffcrentiablc function. The 
simple approach replaces the £i-norm with a sum of multi-quadric functions 

n 

i 

where e is a small positive scalar. This function is twice-diffcrcntiable and lim e _>Q+ l{x) = \\x\\\. Subsequently, 
several smooth unconstrained optimization approaches can be applied, based on this approximation. How- 
ever, the performance of these algorithms is much influenced by the parameter values, and the condition 
number of the corresponding Hessian matrix becomes larger as e decreases. The Nesterov's smoothing tech- 
nique [29] is to construct smooth functions to approximate any general convex nonsmooth function. Based 
on this technique, NESTA 0] solves problem (|1.2p by using first-order gradient information. 

The fourth type of approach falls into the subgradient-based Newton-type algorithm. The important 
attempt in this class is from Andrew and Gao [T] , who extend the well-known limited memory BFGS method 
[31] to solve ^i-regularized logistic regression model (|1.4[) . and propose an orthant-wise limited memory quasi- 
Newton method — OWL-QN. At each iteration, this method computes a search direction over an orthant 
containing the previous point. The subspace BFGS method — subBFGS [41] involves an inner iteration 
approach to find the descent quasi-Newton direction and a sub-gradient Wolfe-condtions to determine the 
stcpsize which ensures that the objective functions are decreasing. This method enjoys global convergence 
and is capable of solving general nonsmooth convex minimization problems. 

Finally, to solve model (]1.2j) . besides GPSR and NESTA, there are other numerous specially designed 
solvers. By an operator splitting technique, Hale, Yin and Zhang derive the iterative shrinkage/thresholding 
fixed-point continuation algorithm (FPC) [2S]. By combining the interior-point algorithm in [26], FPC is 
also extended to solve large-scale ^-regularized logistic regression in [36]. TwIST [7] and FISTA [3] speed 
up the performance of 1ST and have virtually the same complexity but with better convergence properties. 
Another closely related method is the sparse reconstruction algorithm SpaRSA [39] , which is to minimize 
non-smooth convex problem with separable structures. SPGL1 [5] solves the lasso model (|1.2p by the spectral 
gradient projection method with an efficient Euclidean projection on £i-norm ball. The alternating directions 
method — YALL1 [40 , investigates £i-norm problems from either the primal or the dual forms and solves 
^i-regularized problems with different types. 

All the reviewed algorithms differ in various aspects such as the convergence speed, ease of implemen- 
tation, and practical applicability. Moreover, there is no enough evidence to verify that which algorithm 
outperforms the others under all scenarios. 

1.3. Contributions and organization. Although much progress has been achieved in solving the 
problem (|1.1[) . these algorithms mainly deal with the case where / is a convex function even a least square. 
In this paper, unlike all the reviewed algorithms, we propose a Barzilai-Borwein gradient algorithm for 
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solving ^-regularized nonsmooth minimization problems. At each iteration, we approximate / locally by 
a convex quadratic model, where the Hessian is replaced by the multiplies of a spectral coefficient with 
an identity matrix. The search direction is determined by minimizing the quadratic model and taking full 
use of the I i-norm structure. We show that the generated direction is descent which guarantees that there 
exists a positive stcpsizc along the direction. In our algorithm, we adopt the nonmonotone line search of 
Grippo, Lampariello, and Lucidi |24| . which allows the function values to increase occasionally in some 
iteration but decrease in the whole iterative process. The attractive property of the nonmonotone line search 
is that it saves much number of function evaluations which should be the main computational burden in 
large dataset. The method is easily performed, where only the value of objective function and the gradient 
of the smooth term are needed at each iteration. We show that each cluster of the iterates generated by 
this algorithm is a stationary point of F. In this paper, although we mainly consider the £i-regularizer, 
the ^2-norm regularization problem and the matrix trace norm problems can also be readily included in 
our framework. Thus, this broaden the capability of the algorithm. We implement the algorithm to solve 
problem where / is a nonconvcx smooth function from CUTEr library to show its efficiency. Moreover, 
we also run the algorithm to solve £i-regularized least square, and do performance comparisons with the 
state-of-the-art algorithms NESTA, CGD, TwIST, FPC and GPSR. The comparisons results show that the 
proposed algorithm is effective, comparable, and promising. 

We organize the rest of this paper as follows. In Section [51 we briefly recall some preliminary results 
in optimization literature to motivate our work, construct the search direction, and present the steps of our 
algorithm along with some remarks. In Section [31 we establish the global convergence theorem under some 
mild conditions. In Section [4[ we show that how to extend the algorithm to solve ^2-norm and matrix trace 
norm minimization problems. In Section [SJ we present experiments to show the efficiency of the algorithm 
in solving the ^-regularized nonconvex problem and least square problem. Finally, we conclude our paper 
in Section [6] 

2. Algorithm. 

2.1. Preliminary results. First, consider the minimization of the smooth function without the l\- 
norm regularization 

mmf(x). (2.1) 

xER 

The basic idea of Newton's method for this problem is to iteratively use the quadratic approximation qk to 
the objective function f(x) at the current iterate Xk and to minimize the approximation qk- Let / : R" — » K 
be twice continuously diffcrentiable, and its Hessian Gk = V 2 /(xfc) be positive definite. Function / at the 
current Xk is modeled by the quadratic approximation g^, 

f(x k + s) « q k (s) = f(x k ) + Wf{x k ) T s + Ks T G k s, 

where s = x — Xk- Minimizing (j%(s) yields 

x k+ i = x k - G~^ l Vf{x k ), 

which is Newton's formula and Sk = Xk+i — Xk = —G k ~ 1 \7f(xk) is the so-called Newton's direction. 

For the positive definite quadratic function, Newton's method can reach the minimizer with one iteration. 
However, when the starting point is far away from the solution, it is not sure that Gk is positive definite and 
Newton's direction dk is a descent direction. Let the quadratic model of / at Xk+i be 

f(x) « /(xfc+i) + Vf{x k+ i) T {x - x fc+ i) + -{x - x k+ i) T G k+ i(x - x k+ i). 
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Finding the derivative yields 

Vf(x) » V/(a? fc+ i) + G k+1 (x - x k+1 ). 

Setting x = x k , s k = x k+1 - x k , and y k = Vf(x k+1 ) - Vf(x k ) we get 

G k+1 s k » y k . (2.2) 

For various practical problems, the computing cfForts of the Hessian matrices are very expensive, or the 
evaluation of the Hessian is difficult; even the Hessian is not available analytically. These lead to the quasi- 
Newton method which generates a series of Hessian approximations by the use of the gradient, and at 
the same time maintains a fast rate of convergence. Instead of computing the Hessian G k , quasi-Newton 
method constructs the Hessian approximation B k , where the sequence {B k } possesses positive definiteness 
and satisfies 



B k+ is k = y k . (2.3) 

In general, such B k +i will be produced by updating B k with some typical and popular formulae such as 
BFGS, DFP, and SRI. 

Unfortunately, the standard quasi-Newton algorithm, or even its limited memory versions, doesn't scale 
well enough to train very large-scale models involving millions of variables and training instances, which are 
commonly encountered, for example, in natural language processing. The main computational burden of 
Newton-type algorithm is the storage of a large matrix at per-iteration, which may be out of the memory 
capability for a PC. It should be develop a matrix-free algorithm to deal with large-scale problems but 
also belongs to the quasi-Newton framework. For this purpose, it would like to furthermore simplify the 
approximation Hessian B k as a diagonal matrix with positive components, i.e., B k = X k I with an identity 
matrix / and X k > 0. Then, the quasi-Newton condition changes to the form 

Afc+iisfc = y k . 

Multiplying both sides by sj, gives 



,(i) _ s k Vk (94 , 



Similarly, multiplying both sides by yj , yields 

x (2) _ \\yk\\l , 9 .\ 

A fe+1 - (2.5) 

Observing both formulae, it indicates that if sjy k > 0, the matrix X k +il is positive definite, which ensures 
that the search direction — \~ k 1 \/ f(x k ) is descent at current point. 

The formulae (|2.4|) and (|2.5|) were firstly developed by Barzilai and Borwein [2] for the quadratic case 
of /. This method essentially consists the steepest descent method, and adopts the choice of (|2.4[) or (|2.4[) 
as the stepsize along a negative gradient direction. Barzilai and Borwein [5] showed that the corresponding 
iterative algorithm is R-superlinearly convergent for the quadratic case. Raydan [33| presented a globalization 
strategy based on nonmonotonc line search |24j for the general non-quadratic case. Other developments in 
Barzilai and Borwein gradient algorithm can be found in [SJ [TSJ [T7J [T51 [THl [3H HI] . 
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2.2. Algorithm. Due to its simplicity and numerical efficiency, the Barzilai-Borwcin gradient method is 
very effective to deal with large-scale smooth unconstrained minimization problems. However, the application 
of the Barilai-Borwein gradient algorithm to ^-regularized nonsmooth optimization is problematic since the 
regularization is non-diffcrcntiable. In this subsection, we construct an iterative algorithm to solve the l\- 
regularized structured nonconvex optimization problem. The algorithm can be described as the iterative 
form 



Xk+i = Xk + a k dk, 

where ak is a stcpsize, and dk is a search direction defined by minimizing a quadratic approximated model 
oiF. 

Now, we turn to our attention to consider the original problem with £i-regularizer. Since £i-term is not 
diffcrentiable, hence, at current x k , objective function F is approximated by the quadratic approximation 

Qk, 



F(x k + d) =f(x k +d) + fi\\x k + d\\i 

-f(x k ) + Vf(x k ) T d+ ^-\\d\\ 2 2 + fi 



x k 1 + 



\x k 



hd\\, 



Qk(d), 



(2.6) 



where h is a small positive number. The term in [•] can be considered as an approximate Taylor expansion 
of \\x k + d\\i with a small h, and the case h = 1 reduces the equivalent form \\x k + Minimizing (|2.6[) 
yields 



min Q k (d) 
&mmVf(x k ) T d+^\\dg + %\\x k 



hd\\i 



i, 2 



A,, 



min — [V.f(x k ) 1 d+ -±\\d\\l + ^\\x k + 



h 



h 



<^> min - \\Xk + hd- (x k - — Vf(x k )) 



— + hd\\i 



<^ mm 



E {5(4 + M* - (4 - A v .f(xfe))) 2 + ^14 + h*\} 



(2.7) 



where x\, d l , and Vf l (x k ) denote the i-th component of x k , d, and Vf(x k ) respectively. The favorable 
structure of (|2.7I) admits the explicit solution 



hd\ = max • 



4 - f v.H4 

Afe 



M n l x k - IT 

A*. ' J \xi — 



Vf(x k ) 



Hence, the search direction at current point is 



d k 



1 r 

h 



Xk 



Xk 



X 



-V/(z fe ) 



At" 



1 x k 

'°}t— 

> \Xh 



- irW(.Tfe) 



|a*-&V/(z fc )l 



(2.8) 



where | • | and "max" are interpreted as componentwise and the convention • 0/0 = is followed. When 
/t = 0, (|2.8p reduces to d k = —X^ 1 S7f(x k ), i.e., the traditional Barizilai-Borwein gradient algorithm in 
smooth optimization. The key motivation for this formulation is that the optimization problem in Eq. (|2.7[) 
can be easily solved by exploiting the structure of the f r norm. 

Lemma 2.1. For any real vectors a S l n and b £ M. n , the following function L(x) is non- decreasing 

||a + 6x||i - ||a||i 



L(x) 



x G (0, 00). 



(2.9) 
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Proof. Note that 



L{x) 



Ho + teiK-iHi 



|o j + b l x\ - \a l \ 



hence, it reduces to prove that l % (x) is non-decreasing for each i. 

(a) . When a 1 > and a l x + b l > 0. It is clear that l l (x) = b\ 

(b) . When a 1 > and a l a; + 6 l < 0, we have 



P(x) = 

(c). When a 1 < and a l x + fe 1 > 0, we have 

l\x) 



-2a l - b l x 



x 



-2a l 



b\ 



2a l + b l x _ 2a l 
x x 



(d). When a* < and a l x + V > 0, we have l\x) = -b\ 

It is not difficult to see that l l (x) is non-decreasing at each case. Hence, L(x) is non-decreasing. □ 
The following lemma shows that the direction defined by (|2.8|) is descent if dk ^ 0. 
Lemma 2.2. Suppose that X k > and dk is determined by i2.8\) . Then 



F(x k + 6d k ) < F(x k ) + 8 Vf(x k ) T d k + 



H\\x k + hdkWx - At||x fe ||] 



o{0) 0e(0,h], 



and 



(2.10) 



(2.11) 



Proof. By the differentiability of / and the convexity of || x ||i, we have that for any 8 £ (0, ft.] (0/h £ (0,1]), 

F(x k + 8d k ) - F(x k ) = f(x k + 6d k ) - f(x k ) + p\\x k + 6d k \\i - mINIIi 

8 8 
= f(x k + 9d k ) - f(x k ) + pL r{xk + hd k ) + (1 - -)x k ^ - ^\\xk\\i 

8u 9 
< f(x k + 8d k ) - f(x k ) + -£\\x k + ftdfcHi + (1 - t>IK||i - A»||a!fc||i 



8Vf{x k ) T d k +o{8) 



+ hd k \\i - -||xfc||i 
h h 



which is exactly (|2.10l) . 

Noting that d k is the minimizer of (|2.6[) and 8 £ (0, h], from (|2.6|) and the convexity of ||x||i, we have 



V/(x fe ) 1 d k 



, ^ n j 1 1 2 , Mil 31 *! + Mfe||l 



A; 



1141 



\Xk 1 



<8Vf{x k ) T d k + y||fldfc||l + j^Xk + 0hd k \\i - -\\a 



<8Vf(x k ) 1 d fc 



Xj f\\dk\\i + + h 2 d k h + £(1 - ^Hxfcll! - 



Hence, 



(1 - 6)Vf(x k ) T d k + £||x fc + Mfclli - ^\\x k + h'd^ - |(1 - ^Hxfclla < -y (1 - 8 2 )\\d k \\l (2.12) 
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The last three terms of the left side in (|2.12j) can be re-organized as 

j;{\\xk + hd k \\x - -\\x k + fr 2 f4|| i - (1 - — )||as fc ||i| 

a* r n ... [ ||x fc + fe 2 4||i - llgfclli i'i 

= Ti l" ^'l' 1 _ H^ll 1 - |_ £ J ; 

m r ,, I, H I, a \, lkfc + fe 2 4||i - IMi p 

= h l" fe ll x - I' ^H 1 - [ p J J 

h I ""fclli - ll x fclli - r J J 

=^(l-0){||z fe + M fe ||i-|Mi}, (2.13) 
where the inequality is from Lemma 12.11 Combining (|2.12j) with (|2.13|) . it produces 

(i - e)wf{x h y dk + (i - f\\^ + ^h-^h < _ fi)ml (2 . 14) 

Dividing both sides of (|2.14j) by (1 - 6) and noting e (0, h], we get the desirable result (|2.11j) . □ 

When the search direction is determined, a suitable stepsize along this direction should be found to 
determine the next iterative point. In this paper, unlike the traditional Armijo line search or the Wolfe- 
Powell line search, we pay particular attention to a nonmonotone line search strategy. The traditional Armijo 
line search requires the function value to decrease monotonically at each iteration. As a result, it may cause 
the sequence of iterations following the bottom of a curved narrow valley, which commonly occurs in difficult 
nonlinear problems. To overcome this difficultly, a credible alternative is to allow an occasional increase in 
the objective function at each iteration. To easy comprehension of the proposed algorithm, we briefly recall 
the earliest nonmonotone line search technique by Grippo, Lamparicllo, and Lucidi [24j . Let S k £ (0, 1), 
p G (0, 1) and m be a positive integer. The nonmonotone line search is to choose the smallest nonnegative 
integer j k such as the stepsize a k = ap" 1 satisfing 

f(x k + a k d k ) < max f(x k - j ) + 5a k 'Vf{x k ) T d k , (2.15) 

0< j<?n( k ) 

where 

m(0) = and < m(k) < mm{m(k — 1) + 1, m}. 

If m(k) = 0, the above nonmonotone line search reduces to the standard Armijo line search. 

For the ^-regularized nonsmooth problem (|1.1|) . based on Lemma f2.2( the inequality (|2.15|) should be 
modified as 

F(x k + a k d k ) < max F(x k -j) + Sa k A k , (2.16) 

0<j<m( k ) 

where 

A fc = v f { Xk y dk + *\\** + Mf-*\\**\\\ (2 . 17) 

h 

From (|2.1ip . it clear that Afc < — ^Hld&H! < whenever d k ^ 0. Hence, this shows that a k given by (|2.16[) 
is well-defined. 

In light of all derivations above, we now describe the nonmonotone Barzilai-Borwcin gradient algorithm 
(abbreviated as NBBL1) as follows. 
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Algorithm 1. (NBBL1) 



Initialization: Choose xq and constants /i > 0. Constants a > 0, p G (0,1), 5 G (0,1), 

h G (0, 1] and positive integer rh. Set k = 0. 

Step 1. Stop if \\dk\\2 = 0. Otherwise, continue. 

Step 2. Compute d k via H2.8\) . 

Step 3. Compute a k via 12.1 Oil . 

Step 4. Lei x k+1 = x k + a k d k . 

Step 5. Let k = k + 1. Go to Step 1. 

Remark 1. We have shown that if X k > 0, then the generated direction is descent. However, in this 
case, the condition X k > may fail to be fulfilled and the hereditary descent property is not guaranteed any 
more. To cope with this defect, we should keep the sequence {X k } uniformly bounded; that is, for sufficiently 
small A( min ) > and sufficiently large A( max ) > 0, the X k is forced as 

X k — min{A( max ), max{A/c, A( mm )}}. 

This approach ensures that X k is bounded from zero and subsequently ensures that d k is descent at per- 
iteration. 

Remark 2. From Lemma \2.2l it is clear that there exists a constant 9 G (0, h] such that x k + 9d k is a 
descent point in sense of \2.1U\) . Hence, in practical computation, it is suggested to choose the initial stepsize 
as a = h. 

3. Convergence analysis. This section is devoted to presenting some favorable properties of the 
generated direction and establishing the global convergence of Algorithm Q] subsequently. Our convergence 
result utilizes the following assumptions. 

Assumption 1. The level set O = {x : f(x) < f(x Q )} is bounded. 

Lemma 3.1. Suppose that X k > and d k is defined by \2. 8\) with h G (0,1]. Then x k is a stationary 
point of problem M.l)) if and only if d k =0. 

Proof. If dk 7^ 0, then Lemma 12.21 shows that d k is descent direction at x k , which implies that x k is not 
a stationary point of F. On the other hand, if d k = is the solution of (|2.7[) . for any ad G R n with a > 
we have 



aVf(x k ) T d +^\\d\\l + ^\\x k + ahd\\ 1 > Ih^Hl 



Since f(x k + ad) — f(x k ) = aV f(x k ) J d + o(a), this together with (|3.1[) yields 

F'(x k - d) = lim f( Xk + ad ) - + ^ Xk + - ^Mhk 

aio a 

_ lim aS7f(x k ) T d + o(a) + p\\x k + adjji - /4gfcj|i 
a\o a 

>*f\\d\\l + o{a) [/xH^ + adHi-Mll^lli] - [$\\x k + ahd\\i - %\\x k \\ 



> lim 

aio \ a 

>lim -^ Mill 

aiO a 

=0, 



where the second inequality is from Lemma 12.11 Hence, x k is a stationary point of F. □ 
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The proof of the following lemma is similar with the Theorem in |24j . 
Lemma 3.2. Let l(k) be an integer such that 

k — m(k) < l{k) < k and F{xi(u\) = max F{xk-j). 

0<j<m(k) 

Then the sequence {F(x^k))} is nonincreasing and the search direction di(k) satisfies 

Urn a I(fc) ||d I(fc) || 3 = 0. (3.2) 

Proof. From the definition of m(k), we have m(k + 1) < m(k) + 1. Hence 

F(x ;(fe+1) ) = max F(x k +i-j) 

0<j<m(fc+l) 

< max F(x k +i- 3 ) 

0<j<m(fc) + l 

= max{F(i iW ), F(x k+1 )} 

Moreover, by (|2.16p . we have for all k > rh, 



0<j<rn(l(k) — 1) 

= F(a; i ( J ( fe )_i)) + 5ai( fe )_iAi( fe )_i. 

By assumption [T] the sequence {F(xi(k))} admits a limit for k — > oo. Hence, it follows that 

lim a l{k) A l{k) = 0. (3.3) 

On the other hand, from the definition of Afe in f)2.17[) and the inequality (|2.11[) , it is not difficult to deduce 
that 

A, W <-^||4 W ||2<0. 

Combining (|3.3[) . yields 

lim a m \\di {k) \\l = 0, 

which shows the desirable result (|3.2[) . □ 

Theorem 3.3. Let the sequence {x k } and {dk} generated by Algorithm]]^ Then, there exists a subse- 
quence /C such that 

lim R||2=0. (3.4) 

Proof. From [24], it is clear that ([3.2)1 also implies 

lim a fc ||d fe |[ 2 = 0. (3.5) 

k— >oo 
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Now, let i be a limit point of {x k }, and {x k };c 1 be a subsequence of {x k } converging to x. Then by (|3.5[) 
either lim IMfclb = 0, which implies ||d||2 = 0, or there exists a subsequence {x k }ic QC C JC\) such that 

fc— >oo,&E/Ci 

lim dk 7^ and lim = 0. (3-6) 

/c— »oo,/c£/C k— >oc, k£LK. 

In this case, we assume that there exists a constant e > such that 

||4||2>e, Vfce/C. (3.7) 

Since oik is the first value for satisfying f|2 . 16[) , it follows from Step 3 in Algorithm Q] that there exists an 
index fc such that, for all fc > fc and fc e /C, 

F(x k + ^d k ) > max F(x h -j) + <5— A k > F{x k ) + S—A k . (3.8) 

p 0<j<m(k) p p 

Since / is continuous differentiable, by the mean- value theorem on /, we can find there exists a constant 
Ok S (0, 1), such that 

f(x k + —d k ) - f(x k ) = —Vf(x k + 9 k —d k ) T d k . 

P P P 

By combining with (|3.8[l . we have 

Vf{x k +9 k — d k ) d k H ^ > 5A k . (3.9) 

P a-k/p 



Since a = h and a k — > in (|3.6p . we have a k < ph as fc — > oo. It follows from Lemma |2. II that 

< 0. 



p\\x k + tfdkWx - fiWxkWx ^\\ Xk + h d k \\i _ ^ii^Hj 



Subtracting both sides of p.9j) by and noting the definition of Ak, it is clear that 



MlN + - mIN||i A*l|a?fc + hd k\\i - n\\x k \\i 



Vf(x k + 9 k ^d k ) T d k - Vf(x k ) T d k 
P 

>V/(x fe + fc ^4) T d fe - Vf(x k ) T d k + 
P 



a k /p 



>-(l-S)A k 

>(1 - 6)^fl\\d k \\l (3.10) 
Taking the limit as fc 6 /C, fc — > oo in the both sides of (|3.10[) and using the smoothness of /, we obtain 

= Vf(x) T d - Vf(x) T d> (1 - 5)^±\\d\\l 

which implies |rffc||2 — > as fc e /C, fc — > oo. This yields a contradiction because Q3.7j) indicates that ||(ifc||2 is 
bounded. □ 

4. Some extensions. In this section, we show that our algorithm can be readily extended to solve 
^2-riorm and matrix trace norm minimization problems in machine learning; thus, broaden the applicable 
range of our approach significantly. 

Firstly, we consider the ^ 2 - r eg Uiar i za tion problem 

min F(x) = f(x) + p\\x\\ 2 . 



12 



Y. Xiao, S.-Y. Wu and L. Qi 



It is not difficult to deduce that, the search direction d k is determined by minimizing 



1 

mm - 

deK™ 2 



h 2 
x k + hd- (x k - — V/(xfc)) 
Afc 



///( 



||arfc + W||s 



From [2T], the explicit solution is 



a^fe + M fe = max I Xk - T~Vf(xk) - t~, o) 
I Afc 2 A*. J 



xfe - &V/(a*) 



Ffc 



S 2 



rffc = 



Xfc — max • 



£fc - — V/(x fc ) 
Afc 



2 Afc 



,o}. 



Xfe - £V/0r fe ) 



a*- £ V/(z fc )|| 2 



Now, we consider the matrix trace norm minimization problem 



min F(X) = f(X) + n\\X\\ tl , 



A'ei 



(4.1) 



where the functional ||-X"||* is the trace norm of matrix X, which is defined as the sum of its singular values. 
That is, assume that X has r positive singular values of a\ > <T2 > . . . > oy > 0, then ||X||* = a i- The 

matrix trace norm is alternatively known as the Schatten ^i-norm, Ky Fan norm, and nuclear norm [34] . 
Such problem has been received much attention because it is closely related to the affine rank minimization 
problem, which has appeared in many control applications including controller design, realization theory and 
model reduction. 

As it has been done in the previous sections, we can readily reformulate (|2.6p as the following quadratic 
model to determine the search direction, 



1 

mm — 

Dei mx » 2 



X k + hD- (X k - Av/(X fc ))"' 
Afc 



\\X k + hD\ 



(4.2) 



2 Afe 

To get the exact solution of (|4.2[) . we now consider the singular value decomposition (SVD) of a matrix 
Y G M" IX " with rank r, 

Y = UZV T , E = diag({a i } 1 <i< r ), 

where U and V are m x r and r x n matrices respectively with orthonormal columns, and the singular value 
ai is positive. For each r > 0, we let 

V T (Y) = UV T (Z)V T , X> T (£) = diag([a< - r] + ), 

where [•] + = max{0, •}. It is shown that T> T (Y) obeys the following nuclear norm minimization problem [5], 
i.e., 



V T (Y) = argminr||X 



x 



l\\X-Y\ 



(4.3) 



Comparing f|4. 2[) to (|4.3[) . wc deduce that 



X k + hD k = UV^ h/Xk (E)l/ T and V^ h/Xk (E) = diag([a 4 - ^] + 

Afe 



or, equivalcntly, 



Di 



1 



Xk-UV^^V 1 



Subsequently, it is easily to derive the nonmonotonc Barzilai and Borwcin gradient algorithmic framework 
for solving ^2-norm and matrix trace norm regularization problems. 
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5. Numerical experiments. In this section, we present numerical results to illustrate the feasibility 
and efficiency of NBBL1. We partition our experiments into three classes based on different types of /. In 
the first class, we perform our algorithm to solve ^-regularized nonconvex problem. In the second class, we 
test our algorithm to solve ^-regularized least squares which mainly appear in compressive sensing. In the 
third class, we compare some state-of-the-art algorithms in compressive sensing to show the efficiency of our 
algorithm. All experiments are performed under Windows XP and Matlab 7.8 (2009a) running on a Lenovo 
laptop with an Intel Atom CPU at 1.6 GHz and 1 GB of memory. 

5.1. Test on ^i-regularized nonconvex problem. Our first test is performed on a set of the non- 
convex unconstrained problems from the CUTEr |TS] library. The second-order derivatives of all the selected 
problems are available. Since we are interested in large problems, we only consider the problems with size 
at least 100. For these problems, we use the dimensions that is admissible of the "double large" installation 
of CUTEr. The algorithm stops if the norm of the search direction is small enough; that is, 

\\dkh<toh. (5.1) 

The iterative process is also stopped if the number of iterations exceeds 10000 without achieving convergence. 

In this experiment, we take tol\ = 10~ 8 , h = 1, A( min ) = 10~ 20 , A( max ) = 10 20 . In the line search, 
we choose cto = 1, p = 0.35, 5 = 10~ 4 and to = 5. We test NBBL1 with different parameter values 
/i = {0,1/4,1/2,2}. The numerical results are presented in Table HTT1 which contains the name of the 
problem (Problem), the dimensions of the problem (Dim), the number of iterations (Iter), the number of 
function evaluations (Nf), the CPU time required in seconds (Time), the final objective function values 
(Fun), the norm of the final gradient of / (Normg), and the norm of final direction (Normd). 

From Table [57X1 we see that NBBL1 works successfully for all the test problems in each case. Particularly, 
NBBL1 always produces great accuracy solutions within little consuming time. The proposed algorithm re- 
quires large number of iterations for some special problems, such as problems FLETCHER, N0NCVXU2, BR0YDN7D 
with parameter fj, = 0, problems FLETCHER and BR0YDN7D with /i = 0.25, problem FLETCHER with /! = 0.5, 
and VARDIM with fi = 2. However, if lower precision is permitted, the number can be decreased dramatically. 
The first part of Table IBTTI presents the numerical results of NBBL1 for solving a smooth nonconvex mini- 
mization problem without any regularization. From the last second column in this part, we observe that the 
norm of the final gradient is sufficiently small. The important observation verifies that the proposed algo- 
rithm is very efficient to solve unconstrained smooth minimization problems. It is not a pleasant supervise, 
because our algorithm reduces to the well-known nonmonotono Barzailai-Borwein gradient of Raydan |33j 
in this case. 

5.2. Test on ^-regularized least square. Let x be a sparse or a nearly sparse original signal, 
A S R mx ™ (m < n) be a linear operator, ui £ M. m be a zero-mean Gaussian white noise, and b € K m be an 
observation which satisfies the relationship 

b = Ax + uj. 

Recent compressive sensing results show that, under some technical conditions, the desirable signal can be 
reconstructed almost exactly by solving the £i-regularizcd least square (|1.2[) . In this subsection, we perform 
two classes of numerical experiments for solving (|1.2[) by using the Gaussian matrices as the encoder. In the 
first class, we show that our algorithm performs well to decode a sparse signal, while in the second class we 
do a series of experiments with different h to choose the best one. We measure the quality of restoration x* 
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Table 5.1 
Test result for NBBL1 



Problem 


Dim 




Iter 


Nf 


Time 


Fun 


Normg 


Normd 


VAR.DIM 


1000 


0.0 


49 


94 


0.48 


3.2506c-26 


3.6059c-13 


2.5893c-09 


FLETCHER 


100 


0.0 


1217 


1983 


3.75 


3.0113e-10 


2.2576c-05 


9.9505c-09 


COSINE 


10000 


0.0 


51 


350 


23.41 


_Q QQQnp + 03 


2 5387p-03 


4.4188c-09 


SINQUAD 


1000 


0.0 


180 


908 


10.22 


6.4479c-05 


4.9743c-05 


6.8482c-09 


GENROSE 


200 


0.0 


323 


646 


0.80 


l.OOOOc+00 


1.3870c-05 


9.9488c-09 


WOODS 


1000 


0.0 


322 


579 


3.00 


9.9104e-13 


7.2693e-06 


7.5155c-09 


NONCVXU2 


200 


0.0 


4987 


8476 


21.17 


4.6373e+02 


2.0726e-07 


7.0300e-09 


BROYDN7D 


500 


0.0 


1305 


2402 


10.38 


3.8234e+00 


9.3966e-07 


9.2037e-09 


CHAIWOO 


1000 


0.0 


757 


1335 


9.80 


1.0000e+00 


8.OOO60-O6 


4.4747o-09 


VAR.DIM 


1000 


0.25 


49 


94 


0.63 


2 5000p+02 


6 8487p+00 


6 4503p-09 


FLETCHER. 


100 


0.25 


5042 


8657 


15.83 


2.4497e+01 


2.5000c+00 


9.7495c-09 


COSINE 


10000 


0.25 


47 


108 


20.33 






4.8382c-09 


SINQUAD 


1000 


0.25 


46 


c)2 


1.95 


2.8084c-01 




2.2567c-13 


GENROSE 


200 


0.25 


g 


57 


0.06 


1.9846c+02 


3.5267c+00 


3.7742e-09 


WOODS 


1000 


0.25 


645 


1527 


6.69 


2.4911c+02 


7.9057e+00 


7.0300e-09 


NONCVXU2 


200 


0.25 


998 


1957 


4.78 


5.6230c+02 


3.3990c+00 


8.2357e-09 


BROYDN7D 


500 


0.25 


1314 


2366 


10.95 


8.9609e+01 


5.5902e+00 


8.1753c-09 


CHAIWOO 


1000 


0.25 


435 


746 


6.11 


2.5055c+02 


7.9057c+00 


6.6783c-09 


VAR.DIM 


1000 


0.5 


448 


1540 


4.80 


4.8920c+02 


1.4141c+01 


6.6304c-09 


FLETCHER. 


100 


0.5 


2654 


4513 


8.00 


4.8953c+01 


5.0000c+00 


8.4947c-09 


COSINE 


10000 


0.5 


21 


66 


8.45 


1.0042c+02 


4 QQQ^o-r-DI 


q ?^7Dp-f)Q 


SINQUAD 


1000 


0.5 


36 


67 


1.16 


4.2508c-01 


7.0724c-01 


7.0416c-09 


GENROSE 


200 


0.5 


g 


60 


0.08 


1.9887c+02 


7.0534c+00 


5.5793c-09 


WOODS 


1000 


0.5 


643 


1321 


6.03 


4.9644e+02 


1.5811e+01 


2.9571e-09 


NONCVXU2 


200 


0.5 


595 


906 


2.59 


5.7366e+02 


6.7493O+00 


1.5667e-09 


BROYDN7D 


500 


0.5 


1020 


2096 


8.30 


1.7207o+02 


1.1180e+01 


8.8339e-09 


CHAIWOO 


1000 


0.5 


1265 


2109 


17.00 


5.0253o+02 


1. 58110+01 


5.3247o-09 


VAR.DIM 


1000 


2.0 


1506 


4816 


14.81 


1. 7511e+03 


6. 2435o+01 


2.4783c-09 


FLETCHER 


100 


2.0 


996 


1721 


3.13 


2.4344e+02 


2.0000o+01 


2.8322e-09 


COSINE 


10000 


2.0 


2 


3 


1.00 


9.9990e+03 


O.OOOOc+00 


O.OOOOc+00 


SINQUAD 


1000 


2.0 


67 


111) 


2.30 


8.6555e-01 


2.8291c+00 


5.0413c-ll 


GENROSE 


200 


2.0 


2 


3 


0.03 


2.0000e+02 


2. 8213o+01 


0.0000O+00 


WOODS 


1000 


2.0 


102 


283 


1.05 


3.3572e+03 


6. 3246o+01 


3.8964e-09 


NONCVXU2 


200 


2.0 


251 


825 


1.50 


7.3779e+02 


1.7429o+01 


9.7233e-09 


BROYDN7D 


500 


2.0 


152 


455 


1.09 


5.0216e+02 


6. 74580+00 


5.0308e-09 


CHAIWOO 


1000 


2.0 


927 


1634 


13.45 


1.9739c+03 


6. 3246o+01 


9.5918o-09 



by means of the relative error to the original signal x\ that is 



In the first test, we use a random matrix A with independent identically distributions Gaussian entries. 
The uj is the additive Gaussian noise of zero mean and standard deviation a. Due to the storage limitations 
of PC, we test a small size signal with n = 2 11 , m = 2 9 . The original contains randomly p = 2 6 non-zero 
elements. Besides, we also choose the noise level a = 10 -3 . The proposed algorithm starts at a zero point 
and terminates when the relative change of two successive points are sufficient small, i.e., 

"^r^V" 2 < toh. (5.3) 
iFfc-illa 

In this experiment, we take tol = 10~ 4 , h = 10~ 2 , A( min ) — 10 -30 , A( max ) = 10 30 . In the line search, wc 
choose <So = 10 -2 , p — 0.35, 8 = 10~ 4 and m = 5. The original signal, the limited measurement, and the 
reconstructed signal are given in Figure HTT1 
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Original (n = 2048, number of nonzeros = 64) 



Observation 



500 



1000 



1500 



2000 




500 



NBBL1 (RelErr= 2.03 %) 



500 



1000 



1500 



2000 



Fig. 5.1. Left: original signal with length 4096 and 64 positive non-zero elements; Middle: the noisy measurement with 
length 512; Right: recovered signal by NBBL1 (red circle) versus original signal (blue peaks). 



Comparing the left plot to the right one in Figure 15.11 we clearly see that the original sparse signal 
is restored almost exactly. We see that all the blue peaks are circled by the red circles, which illustrates 
that the original signal has been found almost exactly. All together, this simple experiment shows that our 
algorithm performs quite well, and provides an efficient approach to recover large sparse non-negative signal. 

We have clearly known that the last term in the approximate quadratic model (|2.6[) is equivalent to 
\\%k + d\\\ exactly when h = 1. Next, we provide evidence to show that other values can be potentially and 
dramatically better than h = 1. We conduct a series of experiments and compare the performance at each 
case. In our experiments, we set all the parameters values as the previous test except for n = 2 10 . We 
present, in Figure [5~2"1 the impact of the parameter h values on the total number of iterations, the computing 
time, and the quality of the recovered signal. In each plot, the level axis denotes the values of h from 0.01 
to 1 in a log scale. 




h value h value h value 



Fig. 5.2. Performance of NBBL1: number of iterations (left), computing time (middle) and final relative error (right). 
In each plot, the horizontal axis represents the value of h in log scale. 



In Figure [5~121 the number of iterations, the computing time and the quality of restorations are greatly 
influenced by the h values. Generally, as h increases, NBBL1 always has good performance. The right plot 
clearly demonstrates that the relative error decreases dramatically at the very beginning and then becomes 
slightly after 0.2. However, the quality of restoration can not be improved any more after 0.7. On the other 
hand, the left and the middle plots show that the number of iterations and the computing time slightly 
increase after h = 0.8. Taking three plots together, these plots verify that the performance of NBBL1 is 
sensitive to the h values, and the value h £ [0.7, 0.8] may be the better choice. 

5.3. Comparisons with NESTA-Ct, GPSR-BB, CGD, TwIST and FPC-BB. The third class 
of the experiment is to test against several state-of-the-art algorithms which are specifically designed in 
recent years to solve ^i-regularized problems in compressive sensing or linear inverse problems. It is difficult 
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to compare each algorithm in a very fair way, because each algorithm is compiled with different parameter 
settings, such as the termination criterions, the staring points, or the continuation techniques. Hence, as 
usual, in our performance comparisons, we run each code from the same initial point, use all the default 
parameter values, and only observe the convergence behavior of each algorithm to attain a similar accuracy 
solution. 

uses Nesterov's smoothing technique [37J and gradient method [3D] to solve basis pursuit de- 
noising problem. The current version is capable of solving ^-norm regularization problems with different 
types including (|1.2[) . In this experiment, we test NESTA with continuation (named NESTA-Ct) for com- 
parison, where this algorithm solves a sequence of problems (|1.2[) by using a decreasing sequence of values 
of \i. Additionally, NESTA-Ct uses the intermediated solution as a warm start for the next problem. In 
running NESTA, all the parameters are taken as default except TolVar is set to be l.e — 5 to obtain similar 
quality solutions with others. 

gpsr-beJI (Gradient Projections for Sparse Reconstruction) [35] reformulates the original problem (|1.2[) 
as a box-constrained quadric programming problem (|1.6[) by splitting x = u — v. Figueiredo, Nowak and 
Wright use a gradient projection method with Barziali-Borwein steplength [2] for its solution. Moreover, the 
nonmonotone line search |24] is also used to improve its performance. For the comparison with GPSR-BB, 
we use its continuation variant and set all parameters as default. 

The well-known CGe| uses gradient algorithm to solve (|1.5p in order to obtain the search direction 
d\ = ze 1 in i e J, where J is a nonempty subset of {1, ...,n}, and choose the index subset J a Gauss- 
southwcll rule. The iterative process Xk+i — %k + ojfcdfc continues until some termination critera are met, 
where dl = with i ^ J and the stcpsize a k by using a Armijo rule. In running CGD, we use the code CGD 
in its Matlab package, and set all the parameter as default except for init=2 to start the iterative process 
at xq = A T b. 

TwIStS is a two-step 1ST algorithm for solving a class of linear inverse problems. Specifically, TwIST 
is designed to solve 

mmJ(u) + ^\\Au-f\\l (5.4) 

where A is a linear operator, and is a general regularizer, which can be either the £i-norm or the TV. 
The iteration framework of TwIST is 

u k+ i = (1 - a)u fe _i + (a - S)u k + S^^k), 

where a, 6 > are parameters, = u k + A T (/ — Au k ) and 

= argmin J(u) + £||u - &|||. (5.5) 

We use the default parameters in TwIST and terminate the iteration process when the relative variation of 
function value falls below 10~ 4 . 

FPC@ is the fixed-point continuation algorithm to solve the general ^-regularized minimization problem 
(|1.1[) . where / is a continuous differentiable convex function. At current x k and any scalar r > 0, the next 
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iteration is produced by the so-called fixed point iteration 

x k +i = sgn(x k -TVf(x k ))max.{\xk -rVf(x k )\ -/ir,0}, 

where "sgn" is a componentwise sign function. In order to obtain a good practical performance, a continua- 
tion approach is also augmented in FPC. Moreover, the FPC is further modified by using Barzilai-Borwein 
stepsize (code FPC-BB in Matlab package FPC_v2). The continuation and Barzilai-Borwcin stepsize tech- 
niques make FPC-BB faster than FPC. In running of FPC-BB, we use all the default parameter values except 
we set xtol = le-5 to stop the algorithm when the relative change between successive points is below xtol. 

In this test, A is a partial discrete cosine coefficients matrix (DCT), whose to rows are chosen randomly 
from the n x n DCT matrix. Such encoding matrix A does not require storage and enables fast matrix- 
vector multiplications involving A and A T . Therefore, it is able to be used to test much larger size problems 
than using Gaussian matrices. In NBBL1, we take toh = 10~ 4 , h = 0.8, A( min ) = 10~ 30 , A( max ) = 10 30 . 
In the line search, we choose fio = 0.8, p — 0.35, S = 10~ 5 and m = 5. In this comparison, we let 
n = 2 12 , to = f loor(n/4). The original signal x contains p = f loor(m/6) number of nonzero components, 
where floor is a Matlab command used to round an element to the nearest integers towards minus infinity. 
Moreover, the observation b is contaminated by Gaussian noise with level a = le — 3. The goal is to 
use each algorithm to reconstruct x from the observation b by solving (|1.2[) with /j, = 2 -8 . All the tested 
algorithms start at xq = A T b and terminate with different stopping criterions to produce similar quality 
resolutions. To specifically illustrate the performance of each algorithm, we draw four figures to show their 
convergence behavior from the point of objective function values and relative error as the iteration numbers 
and computing time increase, which given in Figure [5731 

From the top plots in Figure I5~3l NBBL1 usually decreases relative errors faster than NESTA-Ct, CCD 
and GPSR-BB throughout the entire iteration process, and meanwhile requires less number of iterations. 
The top right plot shows that TwIST needs less steps than NBBL1 to obtain similar level of relative error. 
However, TwIST is much slower because it has to solve a de-noising subproblem (|5.5[) at each iteration. 
Unfortunately, NBBL1 needs further improvement to challenge the well-known code FPC-BB. We now turn 
our attention to observe the function values behavior of each algorithm. Similarly, NBBL1 is superior to 
NEST-Ct, CCD, GPSR-BB and TwIST from the computing time points of view. FPC-BB reaches the lowest 
function values at the very beginning, and then starts to increase it to meet nearly equal final values at the 
end. In this test, CGD appears to be much slower than the others, because it is sensitive to the choice of 
starting points. If CGD starts at xo = with all the other settings unchanged, its performance should be 
significantly improved |40j . Taking everything together, from the limited numerical experiment, we conclude 
that NBBL1 provides an efficient approach for solving ^i-regularized nonsmooth problem and is competitive 
with or performs better than NESTA-Ct, GPSR-BB, CGD, TwIST and FPC-BB. 

6. Conclusions. In this paper, we proposed, analyzed, and tested a new practical algorithm to solve 
the separable nonsmooth minimization problem consisting of a ^i-norm regularized term and a continu- 
ously diffcrcntiablc term. The type of the problem mainly appears in signal/image processing, compressive 
sensing, machine learning, and linear inverse problems. However, the problem is challenging due to the 
non-smoothness of the rcgularization term. Our approach minimizes an approximal local quadratic model to 
determine a search direction at each iteration. The search direction reduces to the classic Barzilai-Borwein 
gradient method in the case of /i = 0. We show that the objective function is descent along this direction 
providing that the initial stepsize is less than h. We also establish the algorithm's global convergence theorem 
by incorporating a nonmonotone line search technique and assuming that / is bounded below. Extensive 
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experimental results show that the proposed algorithm is an effective toll to solve ^-regularized nonconvex 
problems from CUTEr library. Moreover, we also run our algorithm to recover a large sparse signal from its 
noisy measurement, and numerical comparisons illustrate that our algorithm outperforms or is competitive 
with several state-of-the-art solvers which specifically designed to solve ^-regularized compressive sensing 
problems. 

Unlike all the existing algorithms in this literature, our approach uses an linear model to approximate 
\\%k + d\\i for computing the search direction with a small scalar h; that is 

,i . ,„ _ \\x k + Mli - Ikfelli 

\\Xk + d i ~ \\x k i + . 

h 

Although the equations may hold exactly in the case of h = 1 , a series of numerical experiments show that 
h G [0.7, 0.8] may produce better performance with suitable experiment settings. This approach is distinctive 
and novel; therefore, it is one of the important contributions of this paper. As we all know, the nonmonotone 
Barzilai-Borwein gradient algorithm of Raydan |33j is very effective for smooth unconstrained minimization, 
and its remarkable effectiveness in signal reconstruction problems involving ^-regularized problems has not 
been clearly explored. Hence, our approach can be considered as a modification or extension, to broaden the 
university of [33) . Moreover, the numerical experiments illustrate that our approach performs comparable 
to or even better than several state-of-the-art algorithms. Surely, this is the numerical contribution of our 
paper. Although the proposed algorithm needs further improvement to challenge the well-known code FPC- 
BB, the enhancement of it to deal with non-convex problems is noticeable. Our algorithm is readily to solve 



Nonmonotono Barzilai-Borwcin Gradient Algorithm for ^i-Regularized Nonsmooth Minimization 



19 



the £i-regularized logistic regression, the £ 2 - n orm and matrix trace norm minimization problems in machine 
learning. However, we do not test them in this paper. This should be interesting for further investigations. 
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